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1.0  INTRODUCTION 


1.1  What  is  Phase  Change? 

In  chemistry,  “phase  change”  denotes  the  transition  of  a  mixture  from  one  phase  to 
another  without  a  change  in  ehemieal  eomposition.  For  example,  ice  melting  into  water 
undergoes  a  phase  change.  In  mathematics,  “phase”  is  used  to  denote  the  position  of  a 
periodic  signal.  “Phase  change”  is  often  used  in  mathematics  to  mean  that  a  periodie 
signal  has  ehanged  from  monotonieally  inereasing  to  monotonieally  decreasing,  or  vice 
versa.  For  example,  when  the  angle  of  a  sine  function  increases  from  0  to  7i/2,  its  value 
inereases  from  0  to  1 .  As  the  angle  inereases  from  nil,  the  value  of  sine  begins  to 
deerease.  Thus,  at  the  point  of  nil,  a  sine  function  is  said  to  undergo  a  “phase  ehange”. 

In  our  study,  it  is  the  latter  mathematieal  definition  that  is  more  applieable.  When  we 
refer  to  “phase  change”  we  mean  that  a  group  has  undergone  a  fundamental  change  in 
philosophy  or  attitude.  Philosophieal  changes  are  items  sueh  as  ideology,  intent,  and 
sentiment.  These  ehanges  will  often  manifest  themselves  as  ehanges  in  action,  such  as  a 
formerly  peaeeful  group  suddenly  adopting  violenee  as  part  of  their  dogma.  The  changes 
include  agreements  to  ceasefires,  violations  of  ceasefires,  or  desperate  actions  of  a  losing 
fight  such  as  a  sudden  adoption  of  suicide  bombing  teehniques. 

We  explore  the  notion  of  phase  change.  Specifically,  this  report  attempts  to  answer  the 
question:  is  there  some  signal  that  can  be  monitored  in  the  writings  of  a  group  that  will 
signal  an  imminent  change  of  phase?  To  answer  this  question,  we  gathered  information 
on  world  organizations  advoeating  regime  change,  both  violent  and  peaeeful.  We 
selected  a  group  for  which  we  had  evidence  of  phase  ehanges  (e.g.,  transitions  from  non¬ 
violence  to  violenee  or  upswings  in  violenee  or  ehanges  in  tactics),  and  had  sufficient 
data  for  analysis.  The  writings  of  the  group  were  collected  and  analyzed  through  use  of 
Self-organizing  Maps  (SOMs).  Various  signatures,  including  signatures  of  phase  ehange, 
were  found  in  the  data. 

1.2  Other  Work  in  this  Area 

In  previous  work  by  Allison  Smith  [5]  she  studied  the  psychologieal  characteristics  and 
rhetoric  of  organizations  operating  in  the  same  time  and  plaee.  These  analyses  were 
based  on  mateh  pairs  in  whieh  eaeh  pair  consisted  of  one  future  violent  group,  one  non¬ 
violent  group.  She  studied  doeuments  issued  by  seemingly  similar  group  pairs  to 
determine  if  the  future  terrorist  acts  of  the  violent  group  could  have  been  predicted.  She 
rated  groups  on  variables  sueh  as  “power  motive”  and  “practieality  value”.  These  values 
were  somewhat  subjective  and  manually  eoded  by  a  group  of  scorers.  She  asserted  that 
content  analysis  eould  be  used  as  a  predietor  of  future  behavior. 

Stephan  Green  [1]  took  a  similar  approach  to  Smith,  but  concentrated  on  detecting 
sentiment  in  text.  While  Green’s  approaeh  was  largely  automated  and  aehieved  results 
mueh  better  than  baseline,  it  was  based  upon  grammatieal  relationships  whieh  were 
defined  a  priori. 
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Steven  Shellman  [2-4]  has  sueeessfully  modeled  phase  ehange  of  terrorist  organizations 
based  on  measures  sueh  as  eooperative  and  eonflietive  aetions.  Unlike  the  other  studies 
mentioned,  he  eoneentrated  on  aetions,  rather  than  intent.  His  work  was  based  upon 
news  artieles  about  groups,  rather  than  the  writings  of  the  groups  themselves. 

Our  work  differs  from  these  approaehes  in  several  ways.  First,  we  assume  nothing  a 
priori  about  the  group’s  use  of  language.  We  do  not  manually  look  for  sentiment  in  text, 
for  instanee.  Other  than  a  manual  inspeetion  of  a  word  and  phrase  list  to  eliminate 
redundaneies  and  seleetion  of  two  or  three  writings  of  interest,  our  method  is  eompletely 
automated.  These  aetivities,  too,  eould  be  automated  in  the  future.  Finally,  our  training 
method  is  un-supervised.  We  let  the  algorithm  deeide  what  is  important  in  the  text  to 
determine  how  a  doeument  should  be  eategorized. 

1,3  Selection  of  Study  Group 

Many  groups  were  identified  as  good  eandidates  for  this  study.  Ideally  we  wanted  to 
eompare  the  analysis  of  a  group  that  underwent  a  phase  ehange  with  a  similar  group 
operating  in  the  same  general  area/time  with  similar  goals  that  did  not  exhibit  a  phase 
ehange.  Several  potential  group  pairs  were  obtained  from  Steve  Shellman  and  literature 
sueh  as  Allison  Smith’s  dissertation.  Many  of  the  eontrol  groups  listed  in  Allison 
Smith’s  dissertation  were  later  determined  by  the  United  States  (US)  government  to  be 
funneling  monies  to  terrorist  organizations  and  thus  eould  not  be  eonsidered  as  eontrols. 
Many  others,  sueh  as  the  Palestinian  authority,  quit  writing  when  they  aehieved  many  of 
their  goals  and  their  writings  are  not  arehived.  Many  of  the  Indonesian  groups  do  not 
write  in  English.  When  one  passes  the  groups  through  these  filters,  not  many  good 
eandidates  remain.  We  did  identify  a  group  pair  that  looked  to  be  an  exeellent  eandidate 
based  on  language,  aetivities,  and  quantity  of  writing.  This  pair  eonsisted  of  Operation 
Save  Ameriea  (non-violent  US  abortion  aetivists),  and  Operation  Reseue  (sometimes 
violent  US  abortion  aetivists).  But  due  to  Sandia’s  elose  relationship  with  the 
government,  we  were  eautioned  away  from  this  option  sinee  it  involved  the  direet  study 
of  US  eitizens. 

It  was  deeided  to  pursue  the  Liberation  Tigers  of  Tamil  Eelam  (LTTE)  of  Sri  Lanka, 
eommonly  referred  to  as  the  Tamil  Tigers,  as  a  study  group.  Eor  more  than  25  years,  the 
LTTE  has  been  engaged  in  a  eivil  war  with  the  government  of  Sri  Lanka.  The  island  of 
Sri  Lanka  is  approximately  80%  Buddhist,  and  12%  Tamil  Hindu.  Many  Hindus  feel  that 
they  are  perseeuted  by  the  Buddhist  majority.  The  Tamils  were  peaeeful  for  deeades. 
Then,  in  July  1983  there  were  several  days  of  rioting  where  it  was  reported  that  Buddhist 
Sinhalas  attaeked  and  killed  as  many  as  3,000  of  the  Tamil  minority.  The  government 
and  the  poliee  were  aeeused  by  many  news  organizations  of  turning  a  blind  eye.  This 
event  allowed  the  LTTE  to  gain  power  and  begin  their  quest  for  Tamil  independenee. 
They  instigated  many  types  of  aetivities  that  were  later  adopted  by  A1  Qaeda  and  others. 
Lor  instanee,  LTTE  was  reported  to  be  the  first  terrorist  group  to  use  female  suieide 
bombers. 
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In  early  2002,  an  uneasy  eeasefire  was  signed  between  the  LTTE  and  the  Government  of 
Sri  Lanka  (GoSL).  There  were  many  phase  ehanges  exhibited  by  the  Tigers  before  the 
eeasefire  fully  fell  apart  four  years  later.  In  addition,  there  was  also  a  brief  eeasefire  in 
late  2000.  Thus,  the  period  from  2000-2006  would  be  good  to  examine.  The  LTTE 
maintained  a  web  site  in  English  direetly  attributed  to  them,  whieh  eontained  not  only 
news  stories,  but  dogma  and  sentiment  as  well.  News  on  the  site  was  highly  infiueneed 
by  the  underlying  tenets  of  the  LTTE.  Articles  on  the  LTTE  web  site  were  archived  for 
the  period  of  time  in  question.  There  were  large  Tamil  diaspora  communities  in  Canada, 
the  UK,  and  Singapore.  Many  of  them  were  sympathetic  to  the  LTTE  and  also  wrote  in 
English.  Steve  Shellman  suggested  there  was  a  sister  group,  the  Tamil  United  Liberation 
Eront  (TULL)  that  would  be  a  good  control.  The  LTTE  seemed  to  have  all  the 
characteristics  we  were  looking  for.  Thus,  we  selected  them  as  our  study  group. 
Unfortunately,  we  later  found  that  the  TULE  either  disbanded  or  was  absorbed  by  the 
LTTE  and  their  writings  were  not  archived. 

The  remainder  of  this  report  is  organized  as  follows.  The  next  section  discusses  the 
collection  of  web  documents  containing  writings  by  the  LTTE.  The  following  section 
discusses  the  analysis  of  these  writings  using  a  SOM  technique.  Next,  the  SOM  analysis 
is  repeated  using  various  statistical  techniques  to  determine  the  mathematical  merit  of  the 
data  set.  Einally,  a  summary  and  conclusions  are  provided. 
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2,0  DATA  COLLECTION 


There  were  two  issues  involved  in  the  data  eolleetion  for  this  projeet.  The  first  was  the 
aeeumulation  of  potential  writings  that  may  have  been  relevant.  The  seeond  involved 
sifting  through  these  doeuments  to  determine  whether  or  not  they  were  useful. 

Data  eolleetion  for  the  projeet  was  condueted  on  a  home  eomputer  using  a  free  web- 
based  spider.  Seeded  with  Uniform  Resouree  Loeators  (URLs)  of  what  we  knew  to  be 
LTTE  writings,  the  spider  eolleeted  thousands  of  doeuments.  Many  of  these  were  about 
the  Tigers.  However,  if  an  artiele  about  the  Tigers  eompared  Tamil  perseeution  to  the 
Jewish  holoeaust,  the  spider  would  then  start  collecting  articles  about  the  Jewish 
holocaust.  Thus  many  collected  articles  were  not  about  the  Tigers.  For  operational 
security  reasons,  we  chose  not  to  use  the  Stanley  spider.  If  we  would  have,  we  might 
have  narrowed  the  collection  down  to  a  small  number  which  were  potentially  useful. 

This  operation  largely  had  to  be  done  manually. 

There  are  a  number  of  difficulties  incurred  in  the  attribution  of  writings  to  a  particular 
group  or  even  a  sympathizer.  As  an  example  unrelated  to  the  LTTE,  in  the  initial  search 
for  appropriate  study  groups  a  blogger  was  uncovered  who  posted  many  articles  in  2008. 
He  claimed  to  be  from  Jemaah  Islamya  and  asserted  the  group  was  endorsing  Barak 
Obama  for  president.  After  a  couple  months,  the  blogger  admitted  that  he  was  not  from 
Jemaah  Islamya,  but  was  an  American  trying  to  sway  public  opinion  away  from  Obama. 
There  are  many  examples  of  false  writings  such  as  this  on  the  web.  Had  we  used  the 
Stanley  spider,  it  is  unlikely  it  would  have  flagged  these  early  blogs  as  inappropriate. 
There  would  have  been  some  human  analysis  required  even  in  the  best  case  scenario. 

Much  of  the  data  collected  from  the  free  spider  were  news  articles  posted  in  the  various 
Tamil  diaspora  sites  around  the  world.  After  a  manual  search,  we  were  able  to  obtain  a 
small  nucleus  of  documents  that  we  were  either  certain  the  Tigers  had  written,  or  they 
were  clearly  written  by  Tiger  sympathizers.  Roughly,  there  were  100  files  we  knew  the 
Tigers  had  written.  These  were  used  for  training  and  testing  purposes.  There  were 
another  40  documents  used  solely  for  testing  which  included  not  only  articles  from  Tamil 
Tiger  sympathizers,  but  control  documents  such  as  a  posting  from  the  Sri  Lanka 
government  and  an  article  about  Tamil  cuisine.  Ultimately,  we  would  have  liked  to  have 
collected  more  relevant  documents.  However,  we  felt  we  had  enough  to  perform  a 
preliminary  analysis  and  produce  some  useful  results. 

Another  complication  made  data  collection  more  difficult.  In  early  2009,  the  LTTE 
suffered  major  setbacks  in  their  war  with  Sri  Lanka.  In  January,  the  LTTE  capital  of 
Kilinochchi  fell  to  the  Sri  Lankan  government  army.  A  few  weeks  later,  the  main  LTTE 
website,  where  much  of  the  data  was  collected  for  this  study,  was  taken  offline.  A  partial 
archive  had  been  made  by  an  independent  group  and  was  accessed,  but  it  was  poorly 
indexed. 
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3.0  INITIAL  ANALYSIS  VIA  SOM 

A  SOM  analysis  was  selected  to  be  the  analysis  tool.  We  ran  the  SOM  on  a  nucleus  of 
available  data  to  see  if  it  would  produce  any  useful  results.  That  study  is  described  in 
detail  below. 

3.1  Data  Sets 

For  the  training  data,  63  web  pages  attributed  to  the  Tamil  Tigers  were  used.  The  web 
pages  contained  a  variety  of  sentiments,  including  party  propaganda,  news  reports, 
speeches  from  the  LTTE  leader,  discussions  of  the  peace  process,  childhood  welfare, 
denials,  accusations,  and  ultimatums.  In  the  training  set,  the  contents  of  two  web  pages 
could  be  strongly  connected  to  a  subsequent  attack  by  the  Tamil  Tigers.  For  example,  in 
one  posting,  a  mention  that  the  LTTE  was  going  to  intensify  its  struggle  was  followed  by 
three  attacks  over  the  following  weeks  during  a  period  of  time  when  a  ceasefire  was 
supposedly  in  place. 

The  testing  data  set  consisted  of  33  web  pages  taken  from  various  sites  with  strong 
connections  to  the  LTTE,  or  from  news  sites  containing  interviews  with  LTTE  leaders. 
Some  pages  demonstrating  support  for  the  LTTE  but  lacking  a  firm  connection  to  the 
Tigers  were  used,  such  as  blog  entries  or  editorials.  For  control,  documents  which  were 
obviously  unrelated  were  used,  such  as  a  web  page  that  discussed  Tamil  cuisine. 

In  a  real  scenario,  one  would  like  to  insure  that  documents  truly  attributable  to  the  group 
of  interest  were  used  for  training  purposes.  This  philosophy  would  change  once  the  tool 
is  trained.  In  a  typical  day  to  day  situation,  one  may  not  have  time  to  verify  the  origins  of 
an  incoming  stream  of  data.  Thus  we  included  testing  data  from  various  sources. 

3.2  Data  Processing 

A  self-organizing  map  approach  was  selected  for  this  study  due  to  its  long  reputation  as  a 
useful  pattern  recognition  tool.  The  textual  data  needed  to  be  processed  to  a  format  that 
would  be  useful  to  MATLAB  (MATrix  LABoratory),  the  program  used  to  run  the  SOM 
algorithms.  The  web  pages  were  processed  through  the  Sandia  Analyst  Aide,  resulting 
in  a  matrix  containing  key  words  or  phrases  for  each  web  page  in  the  training  data.  The 
result  was  approximately  5000  key  words  for  the  training  data  set.  Redundancies  were 
eliminated  and  the  list  was  whittled  down  to  approximately  1400  words  or  root  words. 

For  example,  “violent”  and  “violence”  are  words  that  express  the  same  sentiment,  and 
can  both  be  expressed  by  the  root  “violen”. 

A  MATLAB  script  was  written  to  compare  the  words  in  the  web  pages  to  the  list  of  key 
words,  resulting  in  a  numerical  matrix  which  expressed  the  number  of  times  a  word,  root 
word,  or  phrase  appeared  in  each  web  page.  Finally,  the  matrix  was  normalized  so  that 
all  the  web  pages  would  carry  the  same  weight. 


5 

Distribution  A:  Approved  for  public  release;  distribution  is  unlimited.  88ABW-201 1-2160,  13  Apr  2011. 


Legacy  code  was  utilized  in  the  analysis  of  the  resulting  matrix  data.  A  modified  SOM 
developed  for  another  project  was  employed.  By  itself,  a  SOM  will  not  classify  data,  but 
it  allows  for  the  visualization  of  multidimensional  data  in  a  lower  dimensional  space, 
while  still  preserving  the  distance  relationships  among  the  data  points.  For  example,  if 
the  parameter  space  has  14  dimensions,  after  processing  with  a  SOM,  data  points  which 
were  far  away  from  each  other  in  14  dimensional  space  will  still  be  far  apart  in  2 
dimensional  space.  Because  of  the  dimensionality  reduction  involved,  the  SOM  mapping 
is  not  unique.  That  is,  one  data  point  in  the  finished  SOM  will  likely  represent  more  than 
one  data  point  in  the  original  data  set.  Based  on  the  eigenvalues  of  our  training  data  set, 
the  SOM  algorithm  came  up  with  a  map  that  had  dimensions  of  5x8  cells.  That  is,  each 
web  page  could  be  mapped  to  one  of  40  locations  in  the  SOM. 

The  resulting  SOM  representation  can  be  classified  to  put  similar  data  into  groups.  The 
legacy  code  used  allows  the  user  to  “seed”  the  classifier  with  the  location  of  data  points 
that  are  clearly  different.  For  instance,  a  web  page  discussing  child  protection  issues,  and 
a  web  page  expressing  accolades  for  a  suicide  bomber  should  clearly  belong  in  two 
different  groups.  A  modified  K-means  clustering  algorithm  uses  the  seed  locations  as 
starting  points  for  classification,  and  based  on  centroid  distances,  the  result  of  the 
clustering  technique  contains  at  least  as  many  groups  as  seed  files,  usually  more.  The 
user  then  has  the  opportunity  to  examine  the  results  and  merge  groups  if  desired. 

In  this  case,  three  seed  file  locations  were  provided,  a  web  page  containing  an  ultimatum, 
a  web  page  discussing  children’s  issues,  and  a  web  page  where  the  LTTE  is  denying  they 
committed  an  act  the  government  has  attributed  to  them.  Based  on  the  locations  of  the 
data  centroids,  the  classification  algorithm  produced  5  distinct  groups.  The  U-matrix 
(unified  distance  matrix)  is  shown  on  the  left  (Figure  1).  The  U-matrix  visualizes  the 
distances  between  adjacent  units  in  the  SOM.  Red  and  brown  represent  areas  where  the 
distances  between  the  nodes  are  large,  and  blue  areas  are  where  the  distance  is  small. 

The  U-matrix  gives  one  an  idea  how  many  natural  clusters  are  present  in  the  data. 
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Figure  1:  Left:  U-matrix  of  SOM,  Right:  Resulting  Groups  after  Clustering 

Generally  speaking,  the  data  that  map  to  the  blue  areas  are  not  as  linguistically  diverse  as 
the  data  that  map  to  the  red  and  brown  areas.  The  right  side  of  the  figure  represents  the 
result  of  the  clustering.  Given  three  seed  points,  five  distinct  groupings  resulted.  The 
colors  and  numbers  attributed  to  each  class  have  no  inherent  meaning.  When  compared 
to  the  training  data,  the  following  pattern  emerges: 

1)  The  dark  blue  cluster  was  essentially  a  potpourri  of  files  that  didn’t  fit  well  into 
other  groups.  An  editorial  about  the  aims  of  the  new  Sri  Lankan  president  and  a 
discussion  of  ethnic  cleansing  fell  into  this  group.  Note  that  comparison  with  the 
U-matrix  reveals  that  this  class  contained  the  most  diverse  data. 

2)  The  light  blue  cluster  contained  files  with  inflammatory  language,  ultimatums, 
and  many  speeches  from  the  LTTE  leader.  Both  web  pages  connected  to  a 
subsequent  LTTE  attack  fell  into  this  data  set. 

3)  The  green  cluster  contained  pages  that  could  be  largely  classified  as  news  articles. 
That  is,  reporting  of  events  rather  than  commentary. 
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4)  The  orange  eluster  contained  files  that  discussed  children’s  issues,  humanitarian 
issues,  and  the  peace  process.  In  general,  things  not  related  to  attacks. 

5)  The  brown  cluster  contained  files  where  the  LTTE  was  either  reacting  to  actions 
taken  by  the  Sri  Lankan  government,  or  the  LTTE  was  denying  attacks  that  others 
had  said  they  had  executed. 

It  is  also  interesting  to  note  that  files  related  to  the  peace  process  (group  4),  and  files 
related  to  ultimatums,  etc.,  (group  2),  were  the  least  linguistically  diverse. 

There  were  a  few  articles  in  this  set  that  on  the  surface  appeared  to  be  inflammatory,  and 
the  SOM  did  not  place  them  in  the  inflammatory  group  (2).  However,  when  they  were 
correlated  with  the  timeline  of  events,  one  could  see  that  the  Tigers  committed  no  attacks 
that  year.  So  perhaps  the  events  were  not  misclassified.  There  were  also  a  number  of 
web  pages  classified  in  the  inflammatory  group  that  did  not  precede  an  actual  event. 

Perhaps  the  most  interesting  finding  from  the  training  data  set  had  little  to  do  with  phase 
change  on  the  part  of  the  LTTE.  I  found  a  couple  of  examples  where  an  LTTE  web  page 
placed  in  the  peaceful  grouping  (4)  preceded  an  attack  by  the  other  side.  So  perhaps 
peaceful  dialog  on  the  part  of  the  LTTE  is  perceived  by  the  Sri  Lanka  government  as  a 
sign  of  weakness. 

3,3  Results  using  Test  Data  Unseen  by  the  SOM 

The  testing  data  came  mostly  from  sources  other  than  the  official  LTTE  web  sites  and  did 
not  adhere  as  strictly  to  the  nice  groupings  above;  however,  four  documents  did  come 
from  the  official  LTTE  web  sites.  The  following  results  were  obtained; 

a)  There  were  three  files  that  discussed  children’s  issues  or  the  2004  tsunami.  They 
all  mapped  to  group  (4). 

b)  The  article  on  Tamil  cuisine  mapped  to  group  (3),  the  news  article  group.  While 
we  were  surprised  this  didn’t  map  to  the  potpourri  group,  at  least  the  SOM  didn’t 
think  Tamil  cuisine  was  an  imminent  threat  to  peace. 

c)  There  were  four  editorials  from  Tamil  supporters  in  Canada.  They  all  mapped  to 
group  (4),  the  children’s  issues  and  peace  group. 

d)  There  were  8  web  pages  of  interviews  with  the  leader  of  the  LTTE.  These  all 
mapped  to  groups  (2)  and  (3).  However,  the  outcomes  here  could  be  biased  by 
the  text  in  the  interviewer’s  questions.  It  might  be  useful  to  strip  off  the 
interviewer’s  questions  and  reprocess  this  text. 

e)  An  article  about  female  LTTE  cadres  from  the  same  site  where  the  training  data 
was  obtained  mapped  to  group  (3). 

f)  There  were  3  blogs  from  Tamil  supporters.  Two  mapped  to  group  (3),  one  to 
group  (4). 

g)  One  article  was  taken  from  an  official  web  site  of  the  Sri  Lankan  government. 

This  article  denied  the  government’s  involvement  in  acts  the  LTTE  attributed  to 
it.  The  web  page  mapped  to  group  (5),  the  denial  group. 
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A  few  interesting  notes:  there  were  no  files  that  mapped  to  the  potpourri  group.  In 
addition,  no  files  that  could  not  be  attributed  to  the  LTTE  leader  mapped  to  group  2  (the 
inflammatory  speech  group). 

3,4  What  Can  We  Generalize  From  This? 

The  SOM  analysis  technique  seemed  to  do  a  good  job  at  spotting  polarized  writings 
(violent/peaceful)  even  though  examination  of  the  individual  pages  revealed  that  actual 
word  use  was  not  a  good  indicator.  That  is,  when  a  file  that  mapped  to  group  (2)  was 
compared  with  one  that  mapped  to  group  (4),  there  was  no  one  word  or  small  subset  of 
words  that  emerged  as  a  discriminator.  While  that  is  useful,  we  are  more  concerned  with 
the  predictive  nature  of  the  SOM.  The  underlying  problem  is  relating  the  time  line  to  the 
data  set. 

A  subjective  assessment  of  the  data  collected  was  conducted  by  the  project  authors,  who 
are  experts  in  various  areas  of  information  retrieval  and  pattern  recognition  but  are  not 
trained  intel-analysts.  Many  terrorist  acts  were  pinned  on  the  LTTE  by  the  Sri  Lankan 
government  that  did  not  seem  to  be  supported  by  evidence  and  the  LTTE  strictly  denied 
having  performed  them.  We  tried  to  concentrate  on  actions  such  as  political 
assassinations  and  suicide  bombings  that  were  very  obviously  carried  out  by  the  Tigers. 
Two  web  pages  were  determined  manually  due  to  their  sentiment  and  time  placement  to 
strongly  correlate  with  subsequent  LTTE  attacks  after  a  peaceful  period.  These  articles 
both  mapped  to  group  (2).  No  article  outside  group  (2)  could  be  correlated  with  a 
subsequent  terrorist  event. 

In  addition,  the  SOM  seems  to  give  us  an  idea  if  a  document  expressing  violent  sentiment 
was  actually  written  by  the  Tigers.  Although  many  of  the  sympathizer  documents  in  the 
testing  data  set  exhibited  hateful  writing,  none  of  them  were  actually  mapped  by  the 
SOM  to  the  dangerous  group  (2).  These  results  would  seem  to  suggest  that  if  a  web  page 
actually  mapped  to  group  (2),  we  could  be  confident  it  was  authored  by  the  Tigers,  and  be 
relatively  sure  that  they  mean  business. 
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4.0  STATISTICAL  ANALYSIS  OF  DATA  SET  AND  SOM  TECHNIQUE 

Due  to  the  limited  size  of  the  data  set,  analysis  via  bootstrapping  and  jaekknifing 
techniques  was  conducted  to  support  the  credibility  and  impact  of  the  research. 

4,1  Bootstrapping 

In  bootstrapping  analysis,  some  number  N  of  samples  from  the  original  data  set  is  drawn 
at  random,  with  replacement,  and  the  functional  estimate  (in  this  case  a  SOM)  is 
performed.  The  process  is  repeated  M  times.  The  results  are  analyzed  to  determine 
useful  statistics  such  as  confidence  intervals  in  the  case  of  fuzzy  sets  or  percentiles  in  the 
case  of  discrete  sets.  It  is  the  nature  of  the  bootstrap  that  every  run  will  produce  slightly 
different  results.  Since  the  samples  are  drawn  with  replacement,  there  is  a  small  risk  of 
running  into  invertibility  problems  if  too  many  copies  of  the  same  sample  are  used. 
Fortunately,  that  problem  did  not  occur  in  our  experiments. 

We  selected  a  value  of  63  for  N,  which  was  the  number  of  files  used  in  the  original  SOM 
analysis  described  above.  We  had  a  total  of  100  fdes  (web  pages)  that  we  could  directly 
attribute  to  the  LTTE.  All  100  files  were  used  for  the  potential  training  set.  That  is,  63 
files,  with  replacement,  were  selected  from  a  total  data  set  of  100. 

M  was  selected  to  have  a  value  of  200.  Thus,  200  different  draws  of  63  data  files  each 
were  run  through  the  SOM. 

The  SOM  training  was  just  one  facet  of  the  algorithm.  It  was  also  necessary  to  cluster  the 
resulting  U-matrix  into  groups.  As  described  in  the  previous  section,  in  order  to  perform 
the  clustering,  the  user  would  provide  a  number  of  “seed”  files  from  the  data  which 
should  clearly  belong  to  separate  classes.  The  requirement  presents  a  dilemma. 
Technically,  all  N  samples  from  the  original  data  set  should  be  selected  at  random.  If  we 
pre-select  a  number  of  files  (three  in  this  case),  we  are  violating  that  rule.  However,  the 
seed  files  should  ideally  be  from  the  training  set,  not  the  testing  set.  Ultimately,  we  made 
the  decision  to  run  the  bootstrapping  analysis  twice.  The  first  200  runs  were  with  the 
three  seed  files  being  forced  to  be  part  of  the  training  set  (constrained).  In  the  remaining 
200  runs  the  seed  fdes,  due  to  the  nature  of  the  random  selection,  would  sometimes  be  in 
the  training  set  and  would  sometimes  be  in  the  testing  set  (unconstrained).  Since  the 
SOM  may  or  may  not  have  seen  the  seed  data  during  the  training  phase,  we  would  expect 
the  results  of  the  second  set  of  bootstrapping  runs  to  be  slightly  worse. 

Different  data  have  different  eigenvalues.  Since  the  size  of  the  SOM  is  based  on  the 
eigenvalues  of  the  data,  potentially  every  boostrapping  run  may  produce  a  SOM  of  a 
different  size.  In  addition,  the  seed  files,  while  constrained  to  be  in  different  groups,  will 
be  in  different  locations  from  run  to  run.  There  may  also  be  a  different  number  of  groups 
resulting  from  each  run,  depending  on  the  data  files  selected.  Many  draws  of  the  same 
data  file  for  the  training  set  will  produce  a  less  diverse  SOM.  Given  all  these  realities, 
how  do  we  measure  the  results? 
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There  were  a  number  of  files  that,  based  on  manual  interpretation  and  the  early  SOM 
analysis,  clearly  belonged  in  a  certain  group.  For  example,  the  inflammatory  writings 
that  preceded  an  attack  definitely  belong  only  in  the  group  with  other  inflammatory 
writings.  Writings  that  are  not  attributable  to  the  Tigers  should  never  belong  in  the 
inflammatory  group.  The  seed  set  for  clustering  consisted  of  three  files.  The  resulting 
number  of  groups  may  consist  of  three,  but  it  may  consist  of  five  or  six.  Since  the  “extra” 
groups  change  from  run  to  run,  we  cannot  say  much  about  them  with  certainty.  However, 
if  we  restrict  ourselves  just  to  the  three  seed  groups,  and  data  that  clearly  belong  to  one  of 
them,  we  have  a  way  to  measure  the  results.  For  example,  were  the  files  that  clearly 
belong  in  the  inflammatory  group  actually  placed  there  in  all  the  runs? 

A  subset  of  12  files,  both  from  the  training  and  testing  set,  was  selected  that  strongly 
related  to  one  of  the  three  seed  groups.  Performance  was  measured  on  how  often  out  of 
the  200  runs  the  data  sample  was  placed  in  the  designated  group.  For  demonstration 
purposes,  a  fide  whose  group  was  not  so  clear  was  also  included.  It  is  designated  below 
with  an  asterisk  (*). 

For  consistency,  the  seed  files  are  grouped  as  follows. 

Group  1 ;  News  or  denial  (the  seed  file  reported  an  attack  on  GoSL  army  truck  by  an 
unidentified  group) 

Group  2:  Humanitarian  or  social  issues  (the  seed  file  was  an  article  about  children’s 
issues) 

Group  3;  Inflammatory  articles  that  were  attributed  to  the  LTTE  (seed  file  was  an 
ultimatum) 

If  a  document  was  judged  by  the  algorithm  not  to  be  in  one  of  these  groups  a  portion  of 
the  time,  it  is  documented  below  as  other. 


We  would  expect  the  constrained  results  (Table  1)  to  be  slightly  better  than  the 
unconstrained  results  (Table  2).  In  the  constrained  case,  we  are  forcing  the  SOM  to 
include  the  clustering  seed  files  as  part  of  the  training  set.  In  other  words,  we  are  telling 
the  algorithm,  “I  want  you  to  put  these  three  files,  which  you  have  seen  before,  into 
different  groups.”  In  the  unconstrained  case,  we  are  telling  the  algorithm,  “You  may  or 
may  not  have  used  these  files  to  build  your  map,  so  they  may  be  unknown  to  you.  You 
may  not  have  a  location  on  your  map  that  perfectly  describes  them.  But  I  want  you  to  put 
them  in  different  groups.” 
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Table  1:  Boostrapping  Results  (Outlining  Portion  of  Runs  SOM  actually  placed  file 
with  Presumed  Group,  Total  Number  of  Runs=200),  Seed  Group  was  Part  of 

Training  Set  (Constrained) 


Data  file  description 

Presumed  Group  via 
manual  and  old 

SOM 

Actual  group(s) 
found  by 
boostrapping 

Percent  of  runs  in 
each  group 

Training  data 

Ultimatum 
preceding  attack 

3 

3 

100 

News  about  LTTE 
working  to  peace 

1 

1 

100 

Thimpu  declaration 
(objectives  of 

LTTE) 

1 

1/2 

99.5/05 

Article  about  peace 

1 

1/2/other 

76/24/1 

Article  about  killing 
of  humanitarian 
workers 

2 

2/other/3/l 

74/19.5/6/0.5 

Speech  from  LTTE 
leader  (2004), 
inflammatory 
language 

1 

3/2/1 

93.5/3.5/3 

Speech  from  LTTE 
leader  (1992)  more 
conciliatory 

1  or  3* 

1/3/2 

52.5/36.5/10.5 

Testing  data 

Article  about  2004 
tsunami 

1 

1 

100 

LTTE  sympathizer 

1 

1/2 

95.5/4.5 

Blog  accusing 

Sinhala  Buddhists 
of  racism 

1 

1/2/3 

85.5/11/3.5 

Tamil  cuisine  article 

1 

1 

100 

Denial  of  suicide 
attack  by 
sympathizer 

1 

1/2 

79/21 

*  While  the  1992  speech  was  more  conciliatory  than  the  2004,  from  the  language  it  could 
go  either  way.  This  file  was  included  to  demonstrate  that  a  SOM  can  mimic  the 
uncertainty  in  human  decision  making. 
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Table  2:  Bootstrapping  Results  (Outlining  Portion  of  Runs  SOM  actually  placed  file 
with  Presumed  Group,  Total  Number  of  Runs=200),  Seed  Group  May  or  May  Not 
have  been  Part  of  Training  Set  (Unconstrained) 


Data  file  description 

Presumed  Group  via 
manual  and  old 

SOM 

Actual  group(s) 
found  by 
boostrapping 

Percent  of  runs  in 
each  group 

Training  data 

Ultimatum 
preceding  attack 

3 

3 

100 

News  about  LTTE 
working  to  peace 

1 

1 /other 

99/1 

Thimpu  declaration 
(objectives  of 

LTTE) 

1 

1/2/other 

98/1/1 

Article  about  peace 

1 

1/2 

72.5/27.5 

Article  about  killing 
of  humanitarian 
workers 

2 

2/other/3/l 

70.5/18/11/0.5 

Speech  from  LTTE 
leader  (2004), 
inflammatory 
language 

1 

3/1/2/other 

91.5/6/2/0.5 

Speech  from  LTTE 
leader  (1992)  more 
conciliatory 

1  or  3* 

1/3/2 

52.5/39.5/7.5 

Testing  data 

Article  about  2004 
tsunami 

1 

1 /other 

99/1 

LTTE  sympathizer 

1 

1/2/3/other 

95/2/2/1 

Blog  accusing 

Sinhala  Buddhists 
of  racism 

1 

1/2/3/other 

86/9/4.5/0.5 

Tamil  cuisine  article 

1 

1 /other 

99/1 

Denial  of  suicide 
attack  by 
sympathizer 

1 

1/2 

73/27 

(All  training  data  was  selected  from  a  possible  set  of  100  files.  Thus,  in  the  “training 
data”  results  in  the  tables  above,  the  file  specified  was  in  the  training  set,  but  may  not 
have  been  selected  by  the  algorithm  for  training  purposes.  For  this  reason,  we  would 
expect  the  algorithm  not  to  obtain  100%  correct  classification  for  all  files.  In  no  case  was 
the  “testing  data”  viewed  previously  by  the  SOM.) 
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Overall,  the  SOM  performanee  varied  depending  on  what  file  it  was  given.  It  performed 
perfeet  in  some  situations,  but  worse  in  others.  This  implies  that  perhaps  SOM 
performanee  is  based  more  on  quality  of  data  rather  than  on  quantity  of  data.  It  seemed 
to  do  well  at  spotting  ultimatums.  In  faet,  in  the  two  examples  we  have  of  statements 
preeeding  an  attaek,  the  SOM  put  them  in  the  exaetly  the  same  eell  (point  on  the  SOM 
map)  69%  and  68%  of  the  time  in  the  eonstrained  and  uneonstrained  runs,  respeetively. 

In  the  ease  of  inflammatory  writings  (group  3)  20  files  from  the  training  set  of  100  were 
plaeed  in  this  group  over  90%  of  the  time  and  27  files  over  50%  of  the  time  in  the 
eonstrained  run.  In  the  uneonstrained  run,  the  numbers  in  group  3  were  20  files  for  90% 
of  the  time  and  25  files  50%  of  the  time.  Our  initial  SOM  training  exereise  in  the 
previous  seetion  plaeed  23  files  in  this  set.  The  bootstrapping  exereise  and  the  initial 
SOM  exereise  would  seem  to  be  in  agreement.  None  of  the  test  files  were  plaeed  in 
group  3  with  over  50%  eertainty.  This  finding  was  desirable,  sinee  group  three  was 
intended  to  eonsist  of  inflammatory  writings  attributed  only  to  the  LTTE. 

As  predieted,  the  bootstrapping  analysis  in  whieh  the  seed  files  were  not  eonstrained  to 
be  part  of  the  training  set  (the  seeond  table)  performed  more  poorly.  However,  the 
degradation  in  performanee  was  notieeable,  but  not  dramatie.  It  is  also  interesting  to  note 
that  beeause  of  the  random  draw  of  data  files  from  the  pool,  it  turned  out  that  the  set  of 
three  seed  files  were  only  part  of  the  training  data  in  18  of  the  200  runs,  or  9%.  These 
results  would  indieate  that  if  one  were  to  aequire  a  new  important  doeument  that  should 
start  its  own  new  group  eluster,  it  may  not  be  neeessary  to  retrain  the  SOM. 

In  general,  the  results  obtained  by  the  bootstrapping  runs  agreed  with  those  obtained  in 
the  initial  SOM  analysis.  While  it  would  be  more  desirable  if  the  SOM  mapped  all  files 
to  the  same  elasses  at  all  times,  in  this  study  we  were  primarily  eoneemed  with  the 
inflammatory  writings.  In  this  area  the  SOM  did  quite  well.  Most  of  the  other  artieles 
eovered  a  wide  speetrum  of  topies.  It  is  possible  that  many  of  the  other  artieles  would  not 
normally  be  given  to  a  system  looking  for  phase  ehanges. 

4.2  Jackknifing 

In  Jaekknife  analysis,  the  statisties  of  the  original  data  set  are  repeated  with  one  sample 
left  out  of  the  set.  That  is,  if  a  set  eontains  N  data  samples,  the  statisties  are  eomputed  N 
times  with  N-1  data  samples  eaeh  time.  In  eaeh  eomputation,  a  different  sample  is  left 
out.  To  be  eonsistent  with  the  original  SOM,  63  different  realizations  of  the  SOM  were 
trained,  using  62  files  eaeh.  No  training  file  was  repeated  in  the  same  run.  In  all  but  3  of 
the  SOM  realizations,  all  3  seed  files  were  present  in  the  training  set.  Running  the  same 
analysis  as  deseribed  in  the  bootstrap  runs  above,  the  following  results  were  obtained. 
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Table  3:  Jackknife  Results  (Outlining  Portion  of  Runs  SOM  actually  placed  file  with 
Presumed  Group,  Total  Number  of  Runs=63) 


Data  file  description 

Presumed  Group  via 
manual  and  old 

SOM 

Actual  group(s) 
found  by  jackknife 

Percent  of  runs  in 
each  group 

Training  data 

Ultimatum 
preceding  attack 

3 

3 

100 

News  about  LTTE 
working  to  peace 

1 

1 /other 

98.4/1.6 

Thimpu  declaration 
(objectives  of 

LTTE) 

1 

1 /other/2 

95/3/2 

Article  about  peace 

1 

1 

100 

Article  about  killing 
of  humanitarian 
workers 

2 

Other/2 

65/35 

Speech  from  LTTE 
leader  (2004), 
inflammatory 
language 

1 

3/1 

97/3 

Speech  from  LTTE 
leader  (1992)  more 
conciliatory 

1  or  3* 

l/2/other/3 

51/25/22.4/1.6 

Testing  data 

Article  about  2004 
tsunami 

1 

1 /other/2 

85.7/11.1/3.2 

LTTE  sympathizer 

1 

1/2/other 

48/29/23 

Blog  accusing 

Sinhala  Buddhists 
of  racism 

1 

1/2/other 

46/30/24 

Tamil  cuisine  article 

1 

1 /other/2 

59/23.5/17.5 

Denial  of  suicide 
attack  by 
sympathizer 

1 

2/1 

60/40 

(All  training  data  was  selected  from  a  possible  set  of  100  files.  Thus,  in  the  “training 
data”  results  in  the  table  above,  the  file  specified  was  in  the  training  set,  but  may  not  have 
been  selected  by  the  algorithm  for  training  purposes.  For  this  reason,  we  would  expect 
the  algorithm  not  to  obtain  100%  correct  classification  for  all  files.  In  no  case  was  the 
“testing  data”  viewed  previously  by  the  SOM.) 

In  the  training  set,  with  a  few  exceptions,  the  results  seem  to  be  consistent  with  those 
obtained  from  the  bootstrapping  techniques.  In  the  case  of  inflammatory  writings  (group 
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3)  22  files  from  the  training  set  of  100  were  plaeed  in  this  group  over  90%  of  the  time 
and  25  files  over  50%  of  the  time.  Again,  the  results  eoneurred  with  the  previous  runs. 

The  results  with  the  testing  data  were  unexpected.  Essentially,  the  SOM  was  making  a 
separate  grouping  for  the  test  data  a  significant  portion  of  the  time.  The  precise  reasons 
for  why  the  behavior  of  the  clustering  algorithm  differed  so  much  from  the  bootstrapping 
runs  are  unknown.  However,  not  a  single  file  from  the  testing  set  was  placed  erroneously 
into  group  3  by  any  of  the  63  jackknifing  runs. 
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5.0  SUMMARY  AND  CONCLUSIONS 


This  proof  of  concept  study  explored  the  relationship  between  textual  data  written  by 
groups  of  interest  and  their  relationship  to  subsequent  group  phase  ehanges.  The 
teehnique  employed  eonsisted  of  gathering  writings  from  a  group  via  the  world-wide 
web,  and  extraeting  words  and  phrases  from  the  text.  The  resulting  text  data  was 
eonverted  into  a  numerieal  matrix  using  MATLAB.  The  matrix  was  used  to  train  a  SOM, 
and  elustering  was  performed  on  the  resulting  SOM  mapping.  Onee  the  eolleetion  of 
web  pages  had  been  finalized,  the  proeess  was  entirely  automated  exeept  at  two  points: 
the  examination  of  the  master  phrase  list  for  redundaneies,  and  the  seleetion  of  a  small 
number  of  seed  files  for  eluster  groups. 

Unlike  other  work  in  the  area,  we  did  not  presuppose  anything  about  the  signifieanee  of 
words  or  phrases  in  the  text.  The  master  phrase  list  was  generated  automatically  from  the 
training  set  of  files,  though  it  was  manually  inspected  for  redundancies.  The  SOM 
learning  algorithm  was  unsupervised,  and  aside  from  the  seleetion  of  3  seed  files  for 
elustering  purposes,  the  elass  designation  was  also  unsupervised. 

After  an  exploratory  search,  the  Tamil  Tigers  (LTTE)  were  selected  to  serve  as  a  study 
group.  The  selection  of  the  LTTE  was  based  largely  on  the  faets  that  they  have  exhibited 
phase  ehanges  in  the  past,  that  an  arehive  of  their  writings  existed,  and  that  they  wrote 
large  amounts  of  text  in  English. 

Data  eolleetion  was  eondueted  using  a  web  erawling  spider  downloaded  from  the 
internet.  Due  to  the  quality  of  the  spider,  an  inspeetion  of  the  gathered  doeuments  was 
required  to  weed  out  obviously  unrelated  doeuments,  such  as  articles  about  the  Jewish 
holoeaust. 

Initial  results  demonstrated  that  the  SOM  did  a  good  job  of  classifying  inflammatory 
writings  leading  to  violenee  vs.  other  negative  verse,  sueh  as  firm  denials  by  the  LTTE  of 
aets  attributed  to  them.  The  SOM  found  inflammatory  writings  by  the  LTTE  to  be  the 
least  linguistieally  diverse  of  the  artieles  it  analyzed. 

In  addition,  the  SOM  seemed  to  do  a  good  job  of  distinguishing  inflammatory  writings 
aetually  attributed  to  the  LTTE  from  inflammatory  writings  penned  by  sympathizers 
maintaining  web  blogs.  We  did  not  seek  to  have  this  outeome  in  the  beginning,  but  it 
was  an  interesting  observation. 

While  a  speeifie  analysis  of  the  inner  workings  of  the  SOM  was  not  eondueted,  informal 
observation  seemed  to  indieate  that  its  elassifieation  was  based  more  on  sentiment,  than 
individual  word  usage.  Eor  example,  the  number  of  times  the  word  “kill”  appeared  in  a 
doeument  did  not  seem  to  have  a  direet  relationship  to  how  that  doeument  was  elassified. 

A  statistieal  analysis  via  bootstrapping  and  jaekknifing  was  performed  to  add  eredibility 
to  the  findings.  The  statistical  findings  largely  supported  the  initial  SOM  analysis. 
Namely,  the  SOM  technique  did  well  at  elassifying  inflammatory  text  leading  to  terrorist 
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attacks,  articles  with  a  more  peaeeful  tilt,  and  inflammatory  artieles  written  by 
sympathizers  and  not  the  Tigers.  In  some  eases,  the  bootstrapping  and  jaekknifing  pre- 
proeessing  and  SOM  placed  these  documents  in  the  eorreet  group  100%  of  the  time  based 
on  a  total  of  200  trials  each.  The  SOM  teehnique  did  less  well  on  documents  that  could 
be  considered  questionable  in  classification  by  a  human  observer.  For  example,  does  an 
article  eondemning  the  killing  of  humanitarian  workers  belong  in  the  group  that  addresses 
humanitarian  issues,  or  in  the  inflammatory  group?  The  algorithm  placed  it  generally  in 
the  humanitarian  group,  though  not  with  complete  eertainty. 

The  SOM  did  a  good  job  of  deteeting  behavioral  ehanges  in  group  aetivities.  The 
availability  of  the  arehival  reeord  was  ineonsistent  and  sparse.  We  believe  that  with  a 
more  consistent  paper  trail,  we  could  detect  major  attitude  shifts,  as  well.  If  one  were 
continuously  monitoring  the  internet  for  writings  from  a  group,  and  periodically 
retraining  a  SOM,  the  capaeity  to  prediet  the  larger  shifts  is  a  possibility. 

From  this  proof  of  eoneept  exereise  we  did  uncover  the  ability  to  identify  writings 
immediately  preceding  a  terrorist  attack.  In  fact,  these  writings  were  mapped  to  the  same 
identical  cell  in  the  statistieal  analysis  over  2/3  of  the  time.  Such  an  observation  could 
prove  useful,  as  migration  of  data  to  this  micro  group  on  the  map  eould  serve  as  advance 
warning  an  attaek  is  imminent. 

One  interesting  data  point  we  diseovered  in  our  analysis  was  that  a  doeument  written  by 
the  LTTE  signaling  a  move  toward  peace  preeeded  an  attack  by  GoSL  troups.  A 
worthwhile  follow-on  analysis  would  be  a  SOM  based  study  of  interplay  between  Tamil 
Tiger  and  Sri  Lankan  government  writings.  The  Sri-Lankan  government  web  sites  are 
still  up  and  running  and  an  operation  to  eolleet  the  necessary  data  would  be 
straightforward. 

This  study  sought  to  develop  text  analysis  teehniques  to  detect  phase  ehange  in 
organizational  writings.  The  results  suggest  that  it  may  be  possible  that  an  unsupervised 
learning  teehnique  sueh  as  the  SOM  could  be  used  not  only  to  detect  phase  ehanges,  but 
to  provide  an  advanced  warning  of  attaeks,  as  well  as  classify  whether  a  doeument  is 
genuinely  written  by  the  group  or  by  a  sympathizer.  All  the  results  were  obtained 
without  editing  the  word  and  phrase  list  training  data  to  emphasize  sentiment  or  intent.  If 
used  to  continuously  monitor  the  writings  of  a  group  and  re-trained  periodically,  the 
SOM  could  prove  a  useful  tool  in  praetice. 

Why  should  a  deeision  maker  be  interested  in  this  type  of  tool?  We  have  demonstrated 
the  signal  may  be  present  in  text  that  allows  us  to  map  a  group  writing  to  a  subsequent 
violent  action.  Such  analysis  may  help  to  support  eause  and  effect  relationships  found 
among  groups  of  interest.  It  can  help  provide  leading  indicators  that  precede  events  of 
interest.  More  importantly,  if  one  wants  to  influence  the  actions  of  a  group  or  alter  an 
outeome,  such  a  tool  can  help  guide  where  and  when  one  should  make  an  investment. 

At  present,  there  are  some  limitations  that  we  foresee  with  using  adversarial  text  to 
prediet  adversarial  aetions.  We  will  never  be  able  to  use  this  tool  to  definitively  prediet 
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the  actions  of  a  group.  We  can  only  estimate  the  potential  of  a  group  to  undergo  a  phase 
change  or  activity.  But  there  are  several  improvements  that  we  can  make. 

Any  operational  system  will  need  a  mechanism  that  collects  on-line  text  in  real  time, 
determines  its  relevance,  and  passes  the  useful  documents  to  the  analysis  tool.  Useful 
data  acquisition  was  a  hurdle  in  this  project  in  that  separating  appropriate,  signal 
containing  documents  from  the  hundreds  of  documents  that  contained  essentially  noise 
had  to  be  conducted  manually.  For  example,  (as  was  discussed  above)  a  document 
comparing  the  Tamil  plight  to  the  Jewish  holocaust  caused  the  spider  to  begin  collecting 
documents  on  the  Jewish  holocaust.  Any  future  endeavors  in  system  advancement 
should  include  development  of  a  data  acquisition  and  filtering  tool  to  insure  that  only 
documents  concerning  the  group  in  question  are  presented  to  the  system. 

In  addition  to  acquiring  quality  data,  there  is  also  the  issue  of  acquiring  a  quantity  of  data. 
Further  analysis  will  need  to  be  conducted  to  determine  the  amount  and  granularity  of 
data  that  is  required  to  provide  a  reasonable  threat  forecast. 

A  self-organizing  map  approach  was  selected  for  this  study  due  to  its  long  reputation  as  a 
useful  pattern  recognition  tool.  Other  mathematical  and  pattern  recognition  techniques 
may  be  better  suited  for  this  application  and  should  be  examined  as  possible  candidates  in 
any  future  exploration. 

Group  dynamics  occur  across  several  dimensions.  It  is  likely  a  dataset  will  contain 
several  types  of  phase  changes  other  than  those  of  interest  to  operational  personnel. 
Further  work  needs  to  be  performed  in  the  area  to  determine  how  well  we  can  isolate  one 
dimension  of  interest.  In  pursuit  of  this  goal,  it  may  be  useful  to  examine  other  types  of 
phase  changes,  such  as  the  suggestion  of  future  criminal  activity  (i.e.,  Enron  email  data 
set). 

One  limitation  of  the  phase  change  system  may  involve  the  boundaries  of  the  data  itself 
Although  the  internet  itself  is  a  relatively  new  medium,  the  use  of  only  internet  text  to 
suggest  phase  changes  is  somewhat  limiting.  People  do  write  and  signal  each  other 
using  newer  and  older  media.  Many  conversations  are  not  captured  in  text  at  all.  The 
incorporation  of  network  logs  of  mobile  phone  usage  into  the  data  set,  for  example,  has 
the  potential  to  strengthen  greatly  the  system  reliability  to  anticipate  phase  changes. 

In  the  future,  it  would  be  useful  to  refine  the  technique  to  provide  a  graded  scale  of  threat 
level  instead  of  a  binary  indicator  (danger/  no  danger).  The  incorporation  of  a  graded 
metric  scale  would  help  answer  important  questions  such  as  how  fast  a  group  is 
approaching  a  phase  change  threshold.  Further,  metrics  should  be  developed  in  a  format 
such  that  the  threat  scale  is  useful  to  operational  personnel  and  decision  makers. 
Consultation  with  these  persons  to  assess  their  requirements  is  essential  for  future 
advancement. 
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LIST  OF  ACRONYMS 


GoSL 

Government  of  Sri  Lanka 

LTTE 

Liberation  Tigers  of  Tamil  Eelam 

MATLAB 

MATrix  EABoratory 

SOM 

Self-organizing  Map 

TULF 

Tamil  United  Eiberation  Front 

U-matrix 

Unified  Distanee  Matrix 

URLs 

Uniform  Resource  Locators 

US 

United  States 
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